This homework will analyze the dataset which records employee’s personal information. The purpose of this homework is to find all the posible features which might contribute to employee attribution via applying techniques of data visualization and Association Rules Model. Besides, a Shiny App will be created to present the visualization result of Association Rule Model.
Import dataset, convert data type, remove features with no or low variance, remove duplicate, checking and replacing null values, finding and replacing outliers.
Before analyzing the dataset, one should library all the necessary packages.
library(dplyr)
library(arules)
library(knitr)
library(caret)
library(RANN)
library(corrplot)
library(rappdirs)
library(arulesViz)
library(ggplot2)
library(plotly)
library(plyr)
library(tidyr)
library(purrr)
library(rsconnect)
library(shiny)
Importing the Dataset
my.dir <- getwd()
employee_attrition <- read.csv(paste0(my.dir, "/","employee_attrition.csv"), header = TRUE, stringsAsFactors = FALSE)
Checking the data structure.
str(employee_attrition)
## 'data.frame': 1176 obs. of 35 variables:
## $ Age : int 30 52 42 55 35 51 42 23 38 27 ...
## $ Attrition : chr "No" "No" "No" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Non-Travel" ...
## $ DailyRate : int 1358 1325 462 177 1029 1318 932 507 1153 1420 ...
## $ Department : chr "Sales" "Research & Development" "Sales" "Research & Development" ...
## $ DistanceFromHome : int 16 11 14 8 16 26 1 20 6 2 ...
## $ Education : int 1 4 2 1 3 4 2 1 2 1 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Medical" "Medical" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1479 813 936 1278 1529 851 827 1533 1782 667 ...
## $ EnvironmentSatisfaction : int 4 4 3 4 4 1 4 1 4 3 ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ HourlyRate : int 96 82 68 37 91 66 43 97 40 85 ...
## $ JobInvolvement : int 3 3 2 2 2 3 2 3 2 3 ...
## $ JobLevel : int 2 2 2 4 3 4 2 2 1 1 ...
## $ JobRole : chr "Sales Executive" "Laboratory Technician" "Sales Executive" "Healthcare Representative" ...
## $ JobSatisfaction : int 3 3 3 2 2 3 4 3 3 1 ...
## $ MaritalStatus : chr "Married" "Married" "Single" "Divorced" ...
## $ MonthlyIncome : int 5301 3149 6244 13577 8606 16307 6062 2272 3702 3041 ...
## $ MonthlyRate : int 2939 21821 7824 25592 21195 5594 4051 24812 16376 16346 ...
## $ NumCompaniesWorked : int 8 8 7 1 1 2 9 0 1 0 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "No" "No" "No" "Yes" ...
## $ PercentSalaryHike : int 15 20 17 15 19 14 13 14 11 11 ...
## $ PerformanceRating : int 3 4 3 3 3 3 3 3 3 NA ...
## $ RelationshipSatisfaction: int 3 2 1 4 4 3 4 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 2 1 0 1 0 1 1 0 1 1 ...
## $ TotalWorkingYears : int 4 9 10 34 11 29 8 5 5 5 ...
## $ TrainingTimesLastYear : int 2 3 6 3 3 2 4 2 3 3 ...
## $ WorkLifeBalance : int 2 3 3 3 1 2 3 3 3 3 ...
## $ YearsAtCompany : int 2 5 5 33 11 20 4 4 5 4 ...
## $ YearsInCurrentRole : int 1 2 4 9 8 6 3 3 4 3 ...
## $ YearsSinceLastPromotion : int 2 1 0 15 3 4 0 1 0 0 ...
## $ YearsWithCurrManager : int 2 4 3 0 3 17 2 2 4 2 ...
summary(employee_attrition)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 Length:1176 Length:1176 Min. : 102.0
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 461.8
## Median :36.00 Mode :character Mode :character Median : 796.0
## Mean :36.96 Mean : 800.4
## 3rd Qu.:43.00 3rd Qu.:1162.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education EducationField
## Length:1176 Min. : 1.000 Min. :1.000 Length:1176
## Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character
## Mode :character Median : 7.000 Median :3.000 Mode :character
## Mean : 9.496 Mean :2.895
## 3rd Qu.: 14.000 3rd Qu.:4.000
## Max. :224.000 Max. :5.000
## NA's :2
## EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender
## Min. :1 Min. : 1.0 Min. :1.000 Length:1176
## 1st Qu.:1 1st Qu.: 499.8 1st Qu.:2.000 Class :character
## Median :1 Median :1032.5 Median :3.000 Mode :character
## Mean :1 Mean :1036.4 Mean :2.705
## 3rd Qu.:1 3rd Qu.:1574.5 3rd Qu.:4.000
## Max. :1 Max. :2068.0 Max. :4.000
##
## HourlyRate JobInvolvement JobLevel JobRole
## Min. : 30.00 Min. :1.000 Min. :1.000 Length:1176
## 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000 Class :character
## Median : 66.00 Median :3.000 Median :2.000 Mode :character
## Mean : 65.82 Mean :2.741 Mean :2.069
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## NA's :1
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## Min. :1.00 Length:1176 Min. : 1009 Min. : 2094
## 1st Qu.:2.00 Class :character 1st Qu.: 2954 1st Qu.: 8275
## Median :3.00 Mode :character Median : 4950 Median :14488
## Mean :2.71 Mean : 6526 Mean :14468
## 3rd Qu.:4.00 3rd Qu.: 8354 3rd Qu.:20627
## Max. :4.00 Max. :19973 Max. :26999
##
## NumCompaniesWorked Over18 OverTime
## Min. :0.000 Length:1176 Length:1176
## 1st Qu.:1.000 Class :character Class :character
## Median :2.000 Mode :character Mode :character
## Mean :2.709
## 3rd Qu.:4.000
## Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## Min. :11.0 Min. :3.000 Min. :1.000
## 1st Qu.:12.0 1st Qu.:3.000 1st Qu.:2.000
## Median :14.0 Median :3.000 Median :3.000
## Mean :15.3 Mean :3.163 Mean :2.718
## 3rd Qu.:18.0 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :25.0 Max. :4.000 Max. :4.000
## NA's :1 NA's :1 NA's :1
## StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## Min. :80 Min. :0.0000 Min. : 0.0 Min. :0.00
## 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.0 1st Qu.:2.00
## Median :80 Median :1.0000 Median : 10.0 Median :3.00
## Mean :80 Mean :0.7959 Mean : 11.4 Mean :2.81
## 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.: 15.0 3rd Qu.:3.00
## Max. :80 Max. :3.0000 Max. :114.0 Max. :6.00
## NA's :2
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.747 Mean : 6.918 Mean : 4.151
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.125 Mean : 4.242
## 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :15.000 Max. :219.000
## NA's :1
Features Management
Removing features with no or low variances
nzv <- nearZeroVar(employee_attrition, saveMetrics = T)
employee_attrition <- employee_attrition[,c(-9,-22,-27)]
Duplicate Management
ifelse(nrow(employee_attrition) == nrow(employee_attrition[!duplicated(employee_attrition),]), 'The dataset does not have duplicate', 'The dataset has duplicate')
## [1] "The dataset does not have duplicate"
Missing values management
Checking for missing value
sum(!complete.cases(employee_attrition))
## [1] 9
There are 9 missing values in the dataset.
row <- which(apply(employee_attrition, 1, function(x) sum(is.na(x))) > 0)
col <- which(apply(employee_attrition, 2, function(x) sum(is.na(x))) > 0)
print(as.data.frame(unique(employee_attrition[row,col])))
## DistanceFromHome JobLevel PercentSalaryHike PerformanceRating
## 10 2 1 11 NA
## 33 6 2 12 3
## 55 NA 4 13 3
## 64 6 2 NA 3
## 83 1 1 17 3
## 89 16 NA 19 3
## 105 NA 2 14 3
## 969 2 4 11 3
## 994 15 2 24 4
## RelationshipSatisfaction TotalWorkingYears YearsSinceLastPromotion
## 10 2 5 0
## 33 NA 17 1
## 55 4 21 15
## 64 1 20 1
## 83 4 NA 0
## 89 3 9 0
## 105 2 10 1
## 969 4 NA 11
## 994 1 15 NA
Replacing the missing value
Given that the processes of replacement are identical, only one of these processes will be displaced as an exmaple.
employee_attrition$DistanceFromHome[is.na(employee_attrition$DistanceFromHome)] <- round(mean(employee_attrition$DistanceFromHome, na.rm = T))
Except “NA” values, there are also two empty values. Hence, one should find and eliminate these empty value
which(apply(employee_attrition, 1, function(x) sum(x == "")) > 0)
which(apply(employee_attrition, 2, function(x) sum(x == "")) > 0)
employee_attrition <- employee_attrition[c(-1013,-1062),]
Outlier Management
In this process, the mean value of the column will be calculated and rounded. Next, these missing values will be raplced by corresponding mean value.
Before managing outliers, one should convert all the interger data to numeric data and then visualize all the numeric columns to find columns with outliers.
num_var <- sapply(employee_attrition, is.integer)
employee_attrition[, num_var] <- lapply(employee_attrition[, num_var], as.numeric)
char_var <- sapply(employee_attrition, is.character)
employee_attrition[, char_var] <- lapply(employee_attrition[, char_var], as.factor)
employee_attrition %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(y = value, fill = "orange")) +
facet_wrap(~key, scales = "free") +
geom_boxplot() +
labs(title = "The Boxplots of Numeric Columns")
For this graph, one can conclude the columns with outliers:
* DistanceFrom Home
* MonthlyIncome
* …
* YearsWithCurrManager
Even though there are several columns with outliers, only three of them will be managed. The reason is that some outliers are plausible to exist in real world. For example, the maximun of “YearsAtCompany” (which is 40) has been recognized as outlier. However, this number is highly possible to exist.
For some extreme values which are ten times than the rest of the data, such as the maximun value of “DistanceFromHome”, these numbers do not make sense at all. Hence these outliers should be replaced.
employee_attrition$DistanceFromHome[employee_attrition$DistanceFromHome
%in% boxplot.stats(employee_attrition$DistanceFromHome)$out] <- round(median(employee_attrition$DistanceFromHome, na.rm = T))
employee_attrition$TotalWorkingYears[employee_attrition$TotalWorkingYears
%in% boxplot.stats(employee_attrition$TotalWorkingYears)$out] <- round(median(employee_attrition$TotalWorkingYears, na.rm = T))
employee_attrition$YearsWithCurrManager[employee_attrition$YearsWithCurrManager
%in% boxplot.stats(employee_attrition$YearsWithCurrManager)$out] <- round(median(employee_attrition$YearsWithCurrManager, na.rm = T))
Data Type Conversion
In this section, in order to analyze the dataset more efficiently. Numeric data will be converted to Factor data via discretization techniques.
Given that all these processes are repetitive, only four conversion processes will be shown here.
employee_attrition$Age_Group <- arules::discretize(employee_attrition$Age, method = "frequency", breaks = 3, labels = c("young", "mid", "elder"))
employee_attrition$DRate_Group <- arules::discretize(employee_attrition$DailyRate, method = "frequency", breaks = 3, labels = c("low", "medium", "high"))
employee_attrition$DistanceFromHome_Group <- arules::discretize(employee_attrition$DistanceFromHome, method = "frequency", breaks = 3, labels = c("short", "medium", "long"))
employee_attrition$Education_Group <- as.factor(employee_attrition$Education)
Final Presentation of Dataset
Checking for Details of the Dataset and save the dataset to local directory.
employee_factors <- employee_attrition[, sapply(employee_attrition, is.factor)]
str(employee_factors)
## 'data.frame': 1174 obs. of 32 variables:
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 1 3 3 3 3 3 3 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 3 2 2 3 2 2 2 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 4 4 2 3 2 2 5 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 2 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 3 8 1 1 4 5 3 3 9 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 2 2 3 1 3 2 2 3 2 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
## $ Age_Group : Factor w/ 3 levels "young","mid",..: 1 3 3 3 2 3 3 1 2 1 ...
## ..- attr(*, "discretized:breaks")= num 18 32 40 60
## ..- attr(*, "discretized:method")= chr "frequency"
## $ DRate_Group : Factor w/ 3 levels "low","medium",..: 3 3 1 1 2 3 2 1 3 3 ...
## ..- attr(*, "discretized:breaks")= num 102 573 1040 1499
## ..- attr(*, "discretized:method")= chr "frequency"
## $ DistanceFromHome_Group : Factor w/ 3 levels "short","medium",..: 3 3 3 2 3 3 1 3 2 1 ...
## ..- attr(*, "discretized:breaks")= num 1 4 10 29
## ..- attr(*, "discretized:method")= chr "frequency"
## $ Education_Group : Factor w/ 5 levels "1","2","3","4",..: 1 4 2 1 3 4 2 1 2 1 ...
## $ EmployeeNumber_Group : Factor w/ 3 levels "low","medium",..: 3 2 2 2 3 2 2 3 3 1 ...
## ..- attr(*, "discretized:breaks")= num 1 684 1390 2068
## ..- attr(*, "discretized:method")= chr "frequency"
## $ EnvironmentSat_Group : Factor w/ 2 levels "unsatisfactory",..: 2 2 2 2 2 1 2 1 2 2 ...
## ..- attr(*, "discretized:breaks")= num 1 3 4
## ..- attr(*, "discretized:method")= chr "frequency"
## $ HourlyRate_Group : Factor w/ 3 levels "low","medium",..: 3 3 2 1 3 2 1 3 1 3 ...
## ..- attr(*, "discretized:breaks")= num 30 54 78 100
## ..- attr(*, "discretized:method")= chr "frequency"
## $ JobInvolvement_Group : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 3 2 3 2 3 ...
## $ JobLevel_Group : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 4 3 4 2 2 1 1 ...
## $ JobSatisfaction_Group : Factor w/ 2 levels "unsatisfactory",..: 2 2 2 1 1 2 2 2 2 1 ...
## ..- attr(*, "discretized:breaks")= num 1 3 4
## ..- attr(*, "discretized:method")= chr "frequency"
## $ MonthlyIncome_Group : Factor w/ 3 levels "low","medium",..: 2 1 2 3 3 3 2 1 2 1 ...
## ..- attr(*, "discretized:breaks")= num 1009 3692 6524 19973
## ..- attr(*, "discretized:method")= chr "frequency"
## $ MonthlyRate_Group : Factor w/ 3 levels "low","medium",..: 1 3 1 3 3 1 1 3 2 2 ...
## ..- attr(*, "discretized:breaks")= num 2094 10227 18779 26999
## ..- attr(*, "discretized:method")= chr "frequency"
## $ NumCompaniesWorked_Group : Factor w/ 3 levels "small","medium",..: 3 3 3 2 2 2 3 1 2 1 ...
## ..- attr(*, "discretized:breaks")= num 0 1 3 9
## ..- attr(*, "discretized:method")= chr "frequency"
## $ PercentSalaryHike_Group : Factor w/ 4 levels "[11,12)","[12,14)",..: 3 4 3 3 4 3 2 3 1 1 ...
## ..- attr(*, "discretized:breaks")= num 11 12 14 18 25
## ..- attr(*, "discretized:method")= chr "frequency"
## $ PerformanceRating_Group : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
## $ RelationshipSatisfaction_Group: Factor w/ 4 levels "1","2","3","4": 3 2 1 4 4 3 4 2 2 2 ...
## $ StockOptionLevel_Group : Factor w/ 4 levels "0","1","2","3": 3 2 1 2 1 2 2 1 2 2 ...
## $ TotalWorkingYears_Group : Factor w/ 4 levels "[0,6)","[6,10)",..: 1 2 3 3 3 3 2 1 1 1 ...
## ..- attr(*, "discretized:breaks")= num 0 6 10 13 28
## ..- attr(*, "discretized:method")= chr "frequency"
## $ TrainingTimesLastYear_Group : Factor w/ 3 levels "[0,2)","[2,3)",..: 2 3 3 3 3 2 3 2 3 3 ...
## ..- attr(*, "discretized:breaks")= num 0 2 3 6
## ..- attr(*, "discretized:method")= chr "frequency"
## $ WorkLifeBalance_Group : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 2 3 3 3 3 ...
## $ YearsAtCompany_Group : Factor w/ 4 levels "[0,3)","[3,5)",..: 1 3 3 4 4 4 2 2 3 2 ...
## ..- attr(*, "discretized:breaks")= num 0 3 5 9 40
## ..- attr(*, "discretized:method")= chr "frequency"
## $ YearsInCurrentRole_Group : Factor w/ 4 levels "[0,2)","[2,3)",..: 1 2 3 4 4 3 3 3 3 3 ...
## ..- attr(*, "discretized:breaks")= num 0 2 3 7 18
## ..- attr(*, "discretized:method")= chr "frequency"
## $ YearsSinceLastPromotion_Group : Factor w/ 2 levels "[0,1)","[1,15]": 2 2 1 2 2 2 1 2 1 1 ...
## ..- attr(*, "discretized:breaks")= num 0 1 15
## ..- attr(*, "discretized:method")= chr "frequency"
## $ YearsWithCurrManager_Group : Factor w/ 4 levels "[0,2)","[2,3)",..: 2 3 3 1 3 3 2 2 3 2 ...
## ..- attr(*, "discretized:breaks")= num 0 2 3 7 14
## ..- attr(*, "discretized:method")= chr "frequency"
write.csv(employee_factors,'employee_factors.csv')
Conduct exploratory data analysis (EDA): derive descriptive statistics and apply data visualization to check for interesting data patterns.
Based on Attrition (Yes and No), calculate the mean value of each numeric columns and compare the results of different groups.
Yes <- as.data.frame(employee_attrition %>%
group_by(Attrition) %>%
filter(Attrition=='Yes') %>%
summarise(m_Age = mean(Age),
m_Drate = mean(DailyRate),
m_Distance = mean(DistanceFromHome),
m_Edu = mean(Education),
m_Envir_Sat = mean(EnvironmentSatisfaction),
m_JobEnvolve = mean(JobInvolvement),
m_JobLevel = mean(JobLevel),
m_JobSatis = mean(JobSatisfaction),
m_PerfRate = mean(PerformanceRating),
m_StockOption = mean(StockOptionLevel),
m_RelationSat = mean(RelationshipSatisfaction),
m_NumberCompanyWork = mean(NumCompaniesWorked),
m_TotalWorkYear = mean(TotalWorkingYears),
m_Worklife_Balance = mean(WorkLifeBalance),
m_TrainTime_LastYear = mean(TrainingTimesLastYear),
m_Work_Company = mean(YearsAtCompany),
m_CurrentPos_Year = mean(YearsInCurrentRole),
m_Promotion_Year = mean(YearsSinceLastPromotion),
m_CurrentManager_Year = mean(YearsWithCurrManager)))
## m_Age m_Drate m_Distance m_Edu m_Envir_Sat m_JobEnvolve
## 1 3.925396 73.25117 -2.270155 0.002951384 0.3292925 0.2285738
## m_JobLevel m_JobSatis m_PerfRate m_StockOption m_RelationSat
## 1 0.47467 0.4025579 -0.02503758 0.2888913 0.07395403
## m_NumberCompanyWork m_TotalWorkYear m_Worklife_Balance
## 1 -0.3505097 3.214653 0.1306151
## m_TrainTime_LastYear m_Work_Company m_CurrentPos_Year m_Promotion_Year
## 1 0.2449485 2.080196 1.538961 0.4323231
## m_CurrentManager_Year
## 1 1.3953
After comparing the results, one should eliminate the columns which have little difference.
employee_attrition %>%
group_by(Attrition) %>%
summarise(m_Age = mean(Age),
m_Drate = mean(DailyRate),
m_Distance = mean(DistanceFromHome),
m_Joblevel = mean(JobLevel),
m_Total_work_year = mean(TotalWorkingYears),
m_Year_at_Company = mean(YearsAtCompany),
m_Year_current_position = mean(YearsInCurrentRole),
m_Year_promotion = mean(YearsSinceLastPromotion),
m_Year_current_manager = mean(YearsWithCurrManager))
## m_Age m_Drate m_Distance m_Joblevel m_Total_work_year
## 1 36.92845 800.3893 9.309199 2.064736 10.29727
## m_Year_at_Company m_Year_current_position m_Year_promotion
## 1 6.903748 4.139693 2.120954
## m_Year_current_manager
## 1 3.910562
cor_matrix <- cor(employee_attrition[complete.cases(employee_attrition), sapply(employee_attrition, is.numeric)], method = "pearson")
corrplot(cor_matrix,type = 'upper', tl.col = "black")
employee_attrition %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(x = value)) +
facet_wrap(~key, scales = "free") +
geom_histogram(bins = 15, fill = "lightblue", color = "white") +
labs(title = "The Histogram of Numeric Columns")
Barplt of Age Variable
Summary: The mean of number of employees who tend to leave is lower than that of employees who tend not to leave. Besides, the distribution of “not to leave” group follows standard deviation model while that of “to leave” group is right-skewed, implying young employees show more willingness to leave the company.
cdat <- ddply(employee_attrition, "Attrition", summarise, Age.mean = mean(Age))
p<-ggplot(employee_attrition, aes(x = Age, fill = Attrition)) +
geom_histogram(binwidth = 3, alpha = .5, position = "identity")+
geom_vline(data = cdat, aes(xintercept = Age.mean),
linetype = "dashed", size = 1, color = c("pink", "lightblue"))
p<-ggplotly(p)
p
Barplot of DailyRate Variable cdat <- ddply(employee_attrition, "Attrition", summarise, DailyRate.mean = mean(DailyRate))
p<-ggplot(employee_attrition, aes(x = DailyRate, fill = Attrition)) +
geom_histogram(binwidth = 100, alpha = .5, position = "identity")+
geom_vline(data = cdat, aes(xintercept = DailyRate.mean),
linetype = "dashed", size = 1, color = c("pink", "lightblue"))
p<-ggplotly(p)
p
Barplot of DistanceFromHowe Variable cdat <- as.data.frame(ddply(employee_attrition, c("DistanceFromHome_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = DistanceFromHome_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge())
p<-ggplotly(p)
p
Barplot of JobLevel Variable cdat <- as.data.frame(ddply(employee_attrition, c("JobLevel_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = JobLevel_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge())
p<-ggplotly(p)
p
Barplot of Total Work Year Variable cdat <- as.data.frame(ddply(employee_attrition, c("TotalWorkingYears_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = TotalWorkingYears_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
geom_hline(yintercept = mean(employee_attrition$TotalWorkingYears[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
geom_hline(yintercept = mean(employee_attrition$TotalWorkingYears[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p
Barplot of Year at Company Variable cdat <- as.data.frame(ddply(employee_attrition, c("YearsAtCompany_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsAtCompany_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
geom_hline(yintercept = mean(employee_attrition$YearsAtCompany[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
geom_hline(yintercept = mean(employee_attrition$YearsAtCompany[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p
Barplt of Year in Current Position Variable cdat <- as.data.frame(ddply(employee_attrition, c("YearsInCurrentRole_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsInCurrentRole_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
geom_hline(yintercept = mean(employee_attrition$YearsInCurrentRole[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
geom_hline(yintercept = mean(employee_attrition$YearsInCurrentRole[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p
Barplt of Year since Last Promotion cdat <- as.data.frame(ddply(employee_attrition, c("YearsSinceLastPromotion_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsSinceLastPromotion_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
geom_hline(yintercept = mean(employee_attrition$YearsSinceLastPromotion[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
geom_hline(yintercept = mean(employee_attrition$YearsSinceLastPromotion[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p
Barplt of Year with Current Manager cdat <- as.data.frame(ddply(employee_attrition, c("YearsWithCurrManager_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsWithCurrManager_Group, y = Attrition.Number, fill = Attrition)) +
geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
geom_hline(yintercept = mean(employee_attrition$YearsWithCurrManager[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
geom_hline(yintercept = mean(employee_attrition$YearsWithCurrManager[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p
Utilizing asscoiation rules to create a baseline model (setting all the hyper-paramter to default value)
factor_trans <- as(employee_factors[, sapply(employee_factors, is.factor)], "transactions")
baseline <- apriori(factor_trans)
Result of Baseline Model
inspect(head(sort(baseline, by = "lift", decreasing = T), 5))
## lhs rhs support confidence lift count
## [1] {YearsAtCompany_Group=[0,3),
## YearsSinceLastPromotion_Group=[0,1),
## YearsWithCurrManager_Group=[0,2)} => {YearsInCurrentRole_Group=[0,2)} 0.1328790 0.9873418 4.932507 156
## [2] {PerformanceRating_Group=3,
## YearsAtCompany_Group=[0,3),
## YearsSinceLastPromotion_Group=[0,1),
## YearsWithCurrManager_Group=[0,2)} => {YearsInCurrentRole_Group=[0,2)} 0.1090290 0.9846154 4.918887 128
## [3] {YearsAtCompany_Group=[0,3),
## YearsSinceLastPromotion_Group=[0,1)} => {YearsInCurrentRole_Group=[0,2)} 0.1345826 0.9294118 4.643104 158
## [4] {PerformanceRating_Group=3,
## YearsAtCompany_Group=[0,3),
## YearsSinceLastPromotion_Group=[0,1)} => {YearsInCurrentRole_Group=[0,2)} 0.1107325 0.9285714 4.638906 130
## [5] {BusinessTravel=Travel_Rarely,
## YearsAtCompany_Group=[0,3),
## YearsWithCurrManager_Group=[0,2)} => {YearsInCurrentRole_Group=[0,2)} 0.1013629 0.9083969 4.538119 119
Visualize the Baseline Model: Top 10 Item Frequency Plot
itemFrequencyPlot(factor_trans, topN = 10, type = "absolute", main = "Item Frequency")
plot(baseline, jitter = 0)
Summary and Analysis:
One can conclude that with decrease in support, both the values of confidence and lift will increase.
Setting the support to 0.01 implicates that the combination of rules should at least appears twise in the transaction set.
Setting the confidence to 0.8 implicates that the relationship between lhs and rhs are strong enough to support their correlation.
Besides, using as.subset() function can help the programmer to prune redundant transaction rules.
rules_yes <- apriori(factor_trans,
parameter = list(support = 0.01, confidence = 0.8, minlen = 3),
appearance = list(default = "lhs", rhs = c("Attrition=Yes")),
control = list(verbose = F))
subset_rules <- which(colSums(is.subset(rules_yes, rules_yes)) > 1)
rules_yes <- rules_yes[-subset_rules]
result_yes <- head(sort(rules_yes, by = "lift", decreasing = T), 10)
Checking the Top 5 rules
inspect(head(sort(rules_yes, by = "lift", decreasing = T), 5))
## lhs rhs support confidence lift count
## [1] {BusinessTravel=Travel_Frequently,
## MaritalStatus=Single,
## Age_Group=young,
## YearsAtCompany_Group=[0,3)} => {Attrition=Yes} 0.01192504 1 6.345946 14
## [2] {BusinessTravel=Travel_Frequently,
## Age_Group=young,
## StockOptionLevel_Group=0,
## YearsAtCompany_Group=[0,3)} => {Attrition=Yes} 0.01277683 1 6.345946 15
## [3] {MaritalStatus=Single,
## Age_Group=young,
## MonthlyRate_Group=medium,
## YearsInCurrentRole_Group=[0,2)} => {Attrition=Yes} 0.01022147 1 6.345946 12
## [4] {MaritalStatus=Single,
## Age_Group=young,
## MonthlyRate_Group=medium,
## TotalWorkingYears_Group=[0,6),
## YearsSinceLastPromotion_Group=[0,1)} => {Attrition=Yes} 0.01022147 1 6.345946 12
## [5] {OverTime=Yes,
## Age_Group=young,
## JobSatisfaction_Group=unsatisfactory,
## MonthlyIncome_Group=low,
## YearsSinceLastPromotion_Group=[0,1)} => {Attrition=Yes} 0.01022147 1 6.345946 12
Visualize those rules
plot(result_yes, method = "paracoord", control = list(reorder = T))
plot(rules_yes, measure = c("support", "lift"), shading = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(result_yes, method = "graph")
Setting the support to 0.1 implicates that the combination of rules should at least appears 50 times in the transaction set.
Setting the confidence to 0.9 implicates that the relationship between lhs and rhs are strong enough to support their correlation.
Besides, using as.subset() function can help the programmer to prune redundant transaction rules.
rules_no <- apriori(factor_trans,
parameter = list(support = 0.1, confidence = 0.90, minlen = 4),
appearance = list(default = "lhs", rhs = c("Attrition=No")),
control = list(verbose = F))
subset_rules <- which(colSums(is.subset(rules_no, rules_no)) > 1)
rules_no <- rules_no[-subset_rules]
result_no <- head(sort(rules_no, by = "lift", decreasing = T), 15)
The Top 5 rules
inspect(head(sort(rules_no, by = "lift", decreasing = T), 5))
## lhs rhs support confidence lift count
## [1] {OverTime=No,
## JobSatisfaction_Group=satisfactory,
## YearsAtCompany_Group=[5,9)} => {Attrition=No} 0.1286201 0.9805195 1.163933 151
## [2] {OverTime=No,
## JobLevel_Group=2,
## JobSatisfaction_Group=satisfactory} => {Attrition=No} 0.1567291 0.9684211 1.149572 184
## [3] {Department=Research & Development,
## MonthlyIncome_Group=high,
## TrainingTimesLastYear_Group=[3,6]} => {Attrition=No} 0.1022147 0.9677419 1.148765 120
## [4] {OverTime=No,
## YearsAtCompany_Group=[5,9),
## YearsSinceLastPromotion_Group=[1,15]} => {Attrition=No} 0.1149915 0.9642857 1.144663 135
## [5] {OverTime=No,
## JobSatisfaction_Group=satisfactory,
## MonthlyIncome_Group=medium} => {Attrition=No} 0.1371380 0.9640719 1.144409 161
Visualize the rules
plot(result_no, method = "paracoord", control = list(reorder = T))
plot(rules_no, measure = c("support", "lift"), shading = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Visualize the rules
plot(result_no, method = "graph")
For Association Rules Model:
One should run the models several times with different hyper-parameters in order to optimize the results. Even though it seems that the higher the values of support and confidence, the better the result will be, the selection will alter with the purposes of programmer. Besides, one should avoid to set a extrem value (such as setting support to 1.0), which might cause overfitting problem.
According to the analysis illustrated above, there are several factors contributes to “Attribution=Yes” attribute: * BusinessTravel = Travel_Frequently
* Age_Group = Young
* JobLevel_Group = 1
* OverTime = Yes
* JobSatisfaction_Goup = Unsatisfactory
* YearAtCompany_Group = [0,3)
* StockOptionLevel_Group = 0
Hence, one can summarise that those new-enrolled, young employees who travel frequently and work overtime frquently tend to level the company, especially in cases that they feel unsatisfactory about their works and have limited option for stock.
For “Attribution=No” group, the significant factors are: * OverTime = No
* JobSatisfaction_Goup = Satisfactory
* YearAtCompany_Group = [5,9)
* StockOptionLevel_Group = 1
Hence, employees who do not work overtime and satisfy their job tend not to level the company, especially for those who have option for stock and work in this company above 5 years.
Recommendation
In this situation, the manager can investigate features that bring negative influence to employees and then launch some activities to reduce the turnover rate. For example, manager can take inititive to reducev the overtime rate, especially for young employees.
Finally
This is the end of this presentation. If you want to find more interesting data paterns. You can click the link below to talor your own Association Rule Model. Hope you enjoy it!
Shiny App web link
Reminder: Please be patient about the fleshing speed of this App.